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1 Outlier detection for high dimensional data 
Charu C. Aggarwal, Philip S. Yu 



May 2001 ACM SIGMOD Record , Proceedings of the 2001 ACM SIGMOD international 
conference on Management of data, volume 30 issue 2 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: 1jg| pdf(1 97.25 KB) 



The outlier detection problem has important applications in the field of fraud detection, 
network robustness analysis, and intrusion detection. Most such applications are high 
dimensional domains in which the data can contain hundreds of dimensions. Many recent 
algorithms use concepts of proximity in order to find outliers based on their relationship to 
the rest of the data. However, in high dimensional space, the data is sparse and the notion 
of proximity fails to retain its meaningfulness. ... 

Data streams I: Clustering binary data streams with K-means 
Carlos Ordonez 

June 2003 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data 
mining and knowledge discovery 

Full text available: ^ pdf(149.75 KB) Additional Information: full citation , abstract , references , citings 

Clustering data streams is an interesting Data Mining problem. This article presents three 
variants of the K-means algorithm to cluster binary data streams. The variants include On- 
line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced 
that finds higher quality solutions in less time. Higher quality of solutions are obtained with 
a mean-based initialization and incremental learning. The speedup is achieved through a 
simplified set of sufficient statistics and oper ... 



Clustering: Document clustering via adaptive subspace iteration 
Tao Li, Sheng Ma, Mitsunori Ogihara 

July 2004 Proceedings of the 27th annual international conference on Research and 
development in information retrieval 

Full text available: ^ pdf(181.80 KB) Additional Information: full citation , abstract , references , index terms 

Document clustering has long been an important problem in information retrieval. In this 
paper, we present a new clustering algorithm ASI 1 , which uses explicitly modeling of the 
subspace structure associated with each cluster. ASI simultaneously performs data 
reduction and subspace identification via an iterative alternating optimization procedure. 
Motivated from the optimization procedure, we then provide a novel method to determine 
the number of clusters. We also disc ... 
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Keywords: adaptive subspace identification, alternating optimization, document clustering, 
factor analysis 



Usin g approximations to scale exploratory data analysis in datacubes 
Daniel Barbara, Xintao Wu 

August 1999 Proceedings of the fifth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: Ijjjl pdf(607.88 KB) Additional Information: full citation , references , citings , index terms 



5 Identify Re g ions of Interest(ROI) for video watermark embedment with princi ple 
component analy sis 
Roy Wang, Qiang Cheng, Thomas Huang 

October 2000 Proceedings of the eighth ACM international conference on Multimedia 

Full text available: *^ pdf(271.65 KB) Additional Information: full citation , abstract , references , index terms 

The temporal redundancy of video provides a greater space than images for information 
hiding at the expense of invitation towards many forms of spatial and temporal attacks, 
such as frame dropping, frame averaging that are not common in images. With video, the 
active change of watermark placement location serves as an effective counterattack 
measure. In this paper, we utilize principal components of joint feature observation of video 
frames to robustly determine the location of watermark embe ... 

Keywords: PCA, Principal Component Analysis, Region of Interest, clustering, video 
watermarking 
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Efficient algorithms for mining outliers from large data sets 
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim 

May 2000 ACM SIGMOD Record , Proceedings of the 2000 ACM SIGMOD international 
conference on Management of data, volume 29 issue 2 

Full text available- 1SI Ddfd 80 17 KB) Additiona l Information: full citation , abstract , references , citings , index 

terms 

In this paper, we propose a novel formulation for distance-based outliers that is based on 
the distance of a point from its k* h nearest neighbor. We rank each point on the basis of its 
distance to its k* h nearest neighbor and declare the top n points in this ranking to be 
outliers. In addition to developing relatively straightforward solutions to finding such 
outliers based on the classical nested-loop join and index join algorithms, we develo ... 

Solvin g the occlusion problem for three-dimensional distortion-oriented displays 
Donovan Winch, Paul Calder, Raymond Smith 

January 2001 Australian Computer Science Communications , Proceedings of the 2nd 
Australasian conference on User interface, volume 23 issue 5 

Full text available: ^ rfj| 

Tgpdt(l.60MB)^ Additional Information: full citation , abstract , references 
Publisher Site 

Recent research into distortion-oriented displays (DODs) and non-linear magnification 
techniques has considered extending their application to large three-dimensional datasets. 
Inherent properties of three-dimensional datasets introduce some difficulties that do not 
occur in 2D environments. This paper considers the Occlusion Problem - that of context 
data hiding, or occluding, some or all of the data within an area of focus. A novel solution to 
this problem is proposed, namely the use of non-ge ... 
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8 Think globally, fit locally: unsupervised learning of low dimensional manifolds | 
Lawrence K. Saul, Sam T. Roweis 

December 2003 The Journal of Machine Learning Research, volume 4 

Full text available:^ pdf(2.91 MB) Additional Information: full citation , abstract , index terms 

The problem of dimensionality reduction arises in many fields of information processing, 
including machine learning, data compression, scientific visualization, pattern recognition, 
and neural computation. Here we describe locally linear embedding (LLE), an unsupervised 
learning algorithm that computes low dimensional, neighborhood preserving embeddings of 
high dimensional data. The data, assumed to be sampled from an underlying manifold, are 
mapped into a single global coordinate system of lowe ... 

9 Learning response time for WebSources using query feedback and a p plication in query | 
o ptimization 

Jean-Robert Gruser, Louiqa Raschid, Vladimir Zadorozhny, Tao Zhan 

March 2000 The VLDB Journal — The International Journal on Very Large Data Bases, 

Volume 9 Issue 1 

Full text available: ^ pdf (625.36 KB) Additional Information: full citation , abstract , index terms 

The rapid growth of the Internet and support for interoperability protocols has increased the 
number of Web accessible sources, WebSources. Current wrapper mediator architectures 
need to be extended with a wrapper cost model (WCM) for WebSources that can estimate 
the response time (delays) to access sources as well as other relevant statistics. In this 
paper, we present a Web prediction tool (WebPT), a tool that is based on learning using 
query feedback from WebSources. The WebPT uses dimensions ... 

Keywords: Data-intensive applications on the Web, Query languages and systems for Web 
data 



1 0 Subspace clustering for high dimensional data: a review | 
Lance Parsons, Ehtesham Haque, Huan Liu 

June 2004 ACM SIGKDD Explorations Newsletter, volume 6 issue l 

Full text available: ^ pdf(539.13 KB) Additional Information: full citation , abstract , references 

Subspace clustering is an extension of traditional clustering that seeks to find clusters in 
different subspaces within a dataset. Often in high dimensional data, many dimensions are 
irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant 
and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms 
localize the search for relevant dimensions allowing them to find clusters that exist in 
multiple, possibly overlapping subspaces. The ... 

Keywords: clustering survey, high dimensional data, projected clustering, subspace 
clustering 



11 Clustering through decision tree construction 
Bing Liu, Yiyuan Xia, Philip S. Yu 

November 2000 Proceedings of the ninth international conference on Information and 
knowledge management 

Full text available: IS ) pdf(280.62 KB) Additional Information: full citation , references , citin gs, index terms 
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Data clustering: Opening the black box: interactive hierarchical clustering for 
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multivariate spatial patterns 

Diansheng Guo, Donna Peuquet, Mark Gahegan 

November 2002 Proceedings of the tenth ACM international symposium on Advances in 
geographic information systems 

Full text available: « pdf(272.07 KB) Additional lnformation: Ml citation , references, citings, index 

terms 

Clustering is one of the most important tasks for geographic knowledge discovery. 
However, existing clustering methods have two severe drawbacks for this purpose. First, 
spatial clustering methods have so far been mainly focused on searching for patterns within 
the spatial dimensions (usually 2D or 3D space), while more general-purpose high- 
dimensional (multivariate) clustering methods have very limited power in recognizing 
spatial patterns that involve neighbors. Secondly, existing clustering m ... 

Keywords: geographic knowledge discovery, hierarchical subspace clustering, spatial 
ordering, visualization and interaction 



1 3 Clustering algorithms: FREM: fast and robust EM clustering for large data sets 
Carlos Ordonez, Edward Omiecinski 

November 2002 Proceedings of the eleventh international conference on Information 

and knowledge management 

r- ii * ^ , a jr/onn oo i/ D \ Additional Information: full citation , abstract , references , citings, index 
Full text available: THa pdf(200.82 KB) ; ~ ™ 

l£ r terms 

Clustering is a fundamental Data Mining technique. This article presents an improved EM 
algorithm to cluster large data sets having high dimensionality, noise and zero variance 
problems. The algorithm incorporates improvements to increase the quality of solutions and 
speed. In general the algorithm can find a good clustering solution in 3 scans over the data 
set. Alternatively, it can be run until it converges. The algorithm has a few parameters that 
are easy to set and have defaults for most ca ... 

Keywords: EM, clustering, data mining 



1 4 S pecial issue on independent components analysis: Ene rg y-based models for sparse Q 
overcomplete representations 

Yee Whye Teh, Max Welling, Simon Osindero, Geoffrey E. Hinton 
December 2003 The Journal of Machine Learning Research, volume 4 

Full text available: ^ pdf(591.75 KB) Additional Information: full citation , abstract , index terms 

We present a new way of extending independent components analysis (ICA) to 
overcomplete representations. In contrast to the causal generative extensions of ICA which 
maintain marginal independence of sources, we define features as deterministic (linear) 
functions of the inputs. This assumption results in marginal dependencies among the 
features, but conditional independence of the features given the inputs. By assigning 
energies to the features a probability d ... 



15 Fast detection of communication patterns in distributed executions 
Thomas Kunz, Michiel F. H. Seuren 

November 1997 Proceedings of the 1997 conference of the Centre for Advanced Studies 
on Collaborative research 

Full text available: ^| pdf(4.21 MB) Additional Information: full citation , abstract , references , index terms 

Understanding distributed applications is a tedious and difficult task. Visualizations based on 
process-time diagrams are often used to obtain a better understanding of the execution of 
the application. The visualization tool we use is Poet, an event tracer developed at the 
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University of Waterloo. However, these diagrams are often very complex and do not provide 
the user with the desired overview of the application. In our experience, such tools display 
repeated occurrences of non-trivial commun ... 

16 Knowledge discovery in data warehouses 
Themistoklis Palpanas 

September 2000 ACM SIGMOD Record, volume 29 issue 3 

Full text available: ^ pdf(240.77 KB) Additional Information: full citation , abstract , index terms 

As the size of data warehouses increase to several hundreds of gigabytes or terabytes, the 
need for methods and tools that will automate the process of knowledge extraction, or 
guide the user to subsets of the dataset that are of particular interest, is becoming 
prominent. In this survey paper we explore the problem of identifying and extracting 
interesting knowledge from large collections of data residing in data warehouses, by using 
data mining techniques. Such techniques have the ability to i ... 

17 Compressed data cubes for OLAP aggregate query approximation on continuous 
dimensions 

Jayavel Shanmugasundaram, Usama Fayyad, P. S. Bradley 

August 1999 Proceedings of the fifth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available:^ pdf(1 .12 MB) Additional Information: full citation , references , citings , index terms 



Keywords: OLAP, approximate query answering, clustering, data cubes, data mining, 
density estimation 




18 Research track posters: Diagnosing extrapolation: tree-based density estimation Q 
Giles Hooker 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(378.85 KB) Additional Information: full citation , abstract , references , index terms 

There has historically been very little concern with extrapolation in Machine Learning, yet 
extrapolation can be critical to diagnose. Predictor functions are almost always learned on a 
set of highly correlated data comprising a very small segment of predictor space. Moreover, 
flexible predictors, by their very nature, are not controlled at points of extrapolation. This 
becomes a problem for diagnostic tools that require evaluation on a product distribution. It 
is also an issue when we are tryin ... 

Keywords: C4.5, CART, clustering, density estimation, diagnostics, extrapolation, 
interpretation, modeling methodologies, trees-based models, visualization 



19 Poster papers: A unifying framework for detecting outliers and change points from non- Q 
stationary time series data 
Kenji Yamanishi, Jun-ichi Takeuchi 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(572.91 KB) Additional Information: full citation , abstract , references , index terms 

We are concerned with the issues of outlier detection and change point detection from a 
data stream. In the area of data mining, there have been increased interest in these issues 
since the former is related to fraud detection, rare event discovery, etc., while the latter is 
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related to event/trend by change detection, activity monitoring, etc. Specifically, it is 
important to consider the situation where the data source is non-stationary, since the 
nature of data source may change over time in r ... 

20 Similarity Search: Effective nearest neighbor indexing with the euclidean metric 
Sang-Wook Kim, Charu C. Aggarwal, Philip S. Yu 

October 2001 Proceedings of the tenth international conference on Information and 
knowledge management 

Full text available: ^ pdf(2.18 MB) Additional Information: full citation , abstract , references , index terms 

The nearest neighbor search is an important operation widely-used in multimedia 
databases. In higher dimensions, most of previous methods for nearest neighbor search 
become inefficient and require to compute nearest neighbor distances to a large fraction of 
points in the space. In this paper, we present a new approach for processing nearest 
neighbor search with the Euclidean metric, which searches over only a small subset of the 
original space. This approach effectively approximates clusters by ... 

Keywords: Euclidean metric, high dimensional indexes, multimedia databases, nearest 
neighbor queries, similarity search 
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